"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." The Bitter Lesson, Rich Sutton, 2019.

Embeddings and Pretrained networks¶

Glove¶

Glove dates from 2014. All relevant information can be found on the the project site hosted in stanford. The algorithm is rather easy:
"The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus."
The basic idea applies to word2vec as well as to Glove:

This is done with so-called matrix-factorization; The matrix is the co-occurence matrix of words in documents.
The example below is a typical example of collaborative filtering. The matrix-factorization has become very popular in the recommender-system community due to the 1-Million-Dollar-Challenge by Netflix.
At start, each word (item, user) is given a random embedding vector. The scalar product (dot-product) between the vectors is asked to reconstruct the content of the respective cell. The error, i.e. the difference between the scalar-product and the actual content of the cell is propagated into the embedding vectors that are adapted accordingly.
The resulting embeddings are able to reconstruct the word co-occurence matrix (or the ratings a user gave a certain movie).

Word2Vec¶

Word2Vec paper

CBOW: take the embeddings of the surrounding words and try to predict the masked (missing) word in the middle.

Skip-Gram: Take the embedding of the word in the middle and try to predict the words around it.

FastText¶

FastText paper. But a more approchable explanation can be found here:
While Glove and Word2Vec work on the word-level, FastText is working on a n-gram level. In this way it is learning the internal structure of words. Thus, FastText has no out-of-vocabulary words (those not present during training) and is able to learn similarities by the word stems.

BERT embeddings¶

BERT is a classical transformer encoding step. Instead of prediction the next word in a sentence (as done with recurrent neural networks such as LSTMs), it is predicting the masked words (ca.~15%). The information of the words present is shared among all positions in the network.
The [CLS] token signals the beginning of a new sentence (its embedding is often used for sentence classification). The [MASK] token signals the places where the correct word has to be guessed; [PAD] is just to fill all input-sentences to the same length. This is more efficient since sentences can be batched together.

Sentence-Transformers¶

The initial paper

The classification head on the left is used during training. For inference the cosine-similarity between the output-embeddings of Sentence A and Sentece B is computed (right side).

Triplet-Loss¶

In each training step, there is a anchor sentence and a positive example that is semantically equivalent to the anchor sentence. Moreover, there are negative examples that are just random sentences not similar to the anchor sentence.
The so-called triplet-loss function pushes the anchor and the positive embeddings closer to each-other (cosine of 1) and the anchor and the negative embeddings as well as the positive embeddings and the negative embeddings further away from each-other (euclidean or cosine).

\begin{equation*} L = \text{max}\left(\sum_i^N \left[(f_i^a - f_i^p)^2 - (f_i^a - f_i^n)^2 + \alpha\right], 0\right) \end{equation*}

where $\alpha$ is the margin, the amount by which negative examples have to been further away from the anchor than positive examples.

multilingual sentence transformer¶

This is the publication on this ingenious idea.

The student model is the XLMR from Facebook.

Large Language Models (LLMs)¶

Demand of computing power is growing faster than predicted by Moore's Law¶

FLOPS: FLoating point OPerations per Second; What is a Peta FLOP (PFLOP)?: $10^{15}$ operations per second
"To match what a 1 PFLOPS computer system can do in just one second, you'd have to perform one calculation every second for 31,688,765 years." taken from here, pics are taken form here

### The number of parameters is growing faster than the memory of the accelerators 
The size of the Transformers grows 240 times every 2 years
  Cell In[1], line 1
    pics are taken form [here](https://github.com/amirgholami/ai_and_memory_wall/tree/main/imgs/pngs)
         ^
SyntaxError: invalid syntax

taken from this podcast with George Hotz

Carbon Footprint of Neural Network Training¶

The new reality in Data Science¶

Han Xiao, 2019 Founder and CEO of Jina AI

The original Transformer architecture¶

taken from: https://arxiv.org/pdf/1706.03762.pdf

Important things to note are:

  • self-attention
  • positional encoding
  • skip-connections
  • encoder-decoder architecture

taken from https://medium.com/machine-intelligence-and-deep-learning-lab/transformer-the-self-attention-mechanism-d7d853c2c621

Why Self-Attention?

taken from: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

Self-attention allows the model to treat words with weighted importance:

  • key and query are some kind of similarity (cosine), every word is weighted by the query and the attention weights are computed
  • the ouput z is the weighted input

How Self-Attention works¶

taken from http://lucasb.eyer.be/transformer

$$\text{attention(Q, K, V)} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Self-Attention

Byte-Pair-encoding Tokenization¶

Transformer models do not see single characters:

taken from https://platform.openai.com/tokenizer

Architecture of some important Transformer models

taken from Lukas Beyer again

Training¶

taken from Andrej Karpathy

Datasets used for training¶

Colossal Cleaned Crawled Corpus (C4) This is 800GB of cleaned common internet crawl. https://github.com/google-research/text-to-text-transfer-transformer#c4

BookCorpus "The books have been crawled from https://www.smashwords.com, see their terms of service for more information."

Stack-Exchange preferences

Instruction Data-Sets

taken from Lilian Weng

Data-Sets "stolen" from ChatGPT

The project can be found here

Reinforcement-Learning with human feedback (RLHF)

Chain-Of-Thought (COT) Training

Data, the models where trained on¶

Lineage of Chat-GPT

taken from: How does GPT Obtain its Ability?

Google-Researcher "We Have No Moat, And Neither Does OpenAI"¶

LLaMA by Touvron et al.,2023

taken from Andrej Karpathy

Alpaca¶

Distilling ChatGPT: "Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API."

Vicuna¶

"After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca"
Costs: $300

Here is a link to the 'open source' models and there performance.

What did the open-source community solve?¶

LORA (Low-Rank-Adaptation)
This is corresponding paper.

The pretrained weight matrix $\mathbf{W}$ is frozen during training. An additional weight matrix is trained with the two low-rank (r) matrices $\mathbf{A}$ and $\mathbf{B}$. Only these weights (orange) are update. The input-vector (dark-blue) is multiplied with the frozen weights as well as with the low-rank-adaption of the weight matrix. The Results are just added.
During training only the gradients for the orange matrices have to be kept in GPU-memory.

taken from Sebastian Raschka

Bits and Bytes
Tim Dettmers et al., 2022

QLora by Dettmers et al., 2023
In few words, QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. see here

illustration taken from here

SpQR by Dettmers et al., 2023 This is not for training but for inference

"Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x."

LLaMA cpp¶

Plain C/C++ implementation without dependencies

  • LLMs quantized on edge devices
  • on Apple hardware
  • no training, just inference

MLC¶

Prompt Engineering¶

Prompt Engineer will not be a job to stay.

taken from Andrej Karpathy

From Mishra et al., 2022:

  • Use Low-level Patterns: Instead of using terms that require background knowledge to understand, use various patterns about the expected output.
  • Itemizing Instructions: Turn descriptive attributes into bulleted lists. If there are any negation statements, turn them into assertion statements.
  • Break it Down: Break down a task into multiple simpler tasks, wherever possible.
  • Enforce Constraint: Add explicit textual statements of output constraints.
  • Specialize the Instruction: Customize the instructions so that they directly speak to the intended output.

Prompting is only to imitate as well as possible the training data.

Fiction
Since most models are also trained on one or several book corpora, they can also be prompted to take on a fictional persona.

Zero-Shot CoT prompting
Let's think step by step

Remember Byte-Pair-Encoding:

For more funny examples with "O" see this twitter feed.

Why Large Language Models can not calculate with large numbers:

The ultimate solution:¶

Retrieval-Augmented Generative (RAG) process¶

The rise of vector data-bases

taken from here

benchmark can be found here on github

New research is extending the context length

taken from the publication

User: Here is the first book of Harry Potter. (book in looong context) Read it and describe the main characters!

ChatBot:

  1. Harry Potter, the protagonist of the story, who is marked with a lightning scar and destined to fight the evil Lord Voldemort

  2. Hermione Granger, Harry’s best friend and a brilliant witch who excels at academics and logic

  3. Ron Weasley, Harry’s other best friend and a loyal and brave wizard from a large family

  4. Albus Dumbledore, the headmaster of Hogwarts and a powerful and wise wizard who guides and mentors Harry

  5. Lord Voldemort, the main antagonist of the story, who is a dark wizard who seeks to destroy Harry and conquer the wizarding world

User: Now, explain in more detail the role of Albus Dumbledore.

ChatBot: Can you give me the book again? I first have to read it.

research in progress: agents

There are specialized models for:

  • writing sql-queries
  • querying APIs (just feed them the documentation)
  • coding
  • generating images from text-prompts
  • generating real human voice (in your prefered language)
  • transcribing audio files
  • OCR
  • etc...

image is taken from here

How should a company like Contovista / Finnova position itself?¶

(summary of Leadership needs us to do Gen AI, what do we do?)

1. Set expectations:

  • Building cool demos with LLMs is easy
  • But it's hard to build a real product with LLMs
  • Build up some intern competencies (just experiment)
  • set goals and expectation for real products
  • be willing to invest

2. Minimize risk:

Analyze:

  • Can competitors with generative AI make me obsolete? How fast will competitors move?
    -> Go all in
  • I'll miss opportunities to boost revenue? -> Build or buy

Have a data strategy:

  • consolidate existing data
  • ensure data quality and data governance

Avoid big sweeping decisions:

  • don't change everything
  • invest in things that last
Image('../images/is_the_case_worth_it.png')

https://www.linkedin.com/posts/yangpeter_everyones-pivoting-to-generative-ai-but-activity-7073305428255789056-l9Ns/?utm_source=share&utm_medium=member_android

Here are 5 questions to ask to understand if a gen AI product will be successful:

1/ If you took the word "AI" out, is the product still solving a customer problem?

AI is a solution, not a problem.

Ask yourself:

  1. What is the pain point?
  2. How many users share this pain?
  3. Is the pain big enough to take action?
  4. Is the pain underserved by non-AI tools?

2/ How accurate does the solution need to be?

Plot the problem on a fluency vs. accuracy grid.

Gen AI today is great for high fluency + low accuracy problems (e.g., productivity).

It's not great for solutions that need high accuracy (e.g., financial decisions).

3/ How fast will incumbents move?

Incumbents like Microsoft, Google, and Adobe have moved incredibly fast on AI.

Startups that overlap with core incumbent use cases might struggle.

e.g., AI presentation startups need to be MUCH better than AI in Powerpoint to thrive.

4/ Is there a moat?

Examples moats include:

  • Access to proprietary data and models
  • Exclusive contracts with large customers
  • Great product even without AI
  • Exceptional talent in the selected field
  • Business models that incumbents avoid

And of course...speed of execution.

5/ Is it overvalued?

If an AI product already has $100M+ valuation, you should think:

Can it continue to grow and (more importantly) retain users?

In a crowded space like AI copywriting and productivity - that could get hard.

6/ To recap, here are 5 questions to ask to evaluate AI products and companies:

  1. Without "AI", is it still solving a problem?
  2. How accurate does the solution need to be?
  3. How fast will incumbents move?
  4. Is there a moat?
  5. Is it overvalued?

7/ I hope these questions also help builders who are thinking of creating new AI products.